Goal :Our objective is to categorize Iris flowers into two classes based on their dimensional attributes. The classification is binary, there are two classes: virginica and non-virginica.
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap
import plotly.graph_objs as go
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import plotly
plotly.offline.init_notebook_mode()
import numpy as np
import pandas as pd
from mlxtend.plotting import plot_decision_regions
from sklearn.datasets import load_iris
iris_dataset = load_iris(as_frame=True)
iris_dataset.data.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 |
y = iris_dataset.target_names[iris_dataset.target] == 'virginica'
y
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
for i in range(0, len(iris_dataset.target)):
if iris_dataset.target[i] == 2:
iris_dataset.target[i] = 'virginica'
else:
iris_dataset.target[i] = 'non-virginica'
iris_dataset
{'data': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns],
'target': 0 non-virginica
1 non-virginica
2 non-virginica
3 non-virginica
4 non-virginica
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: target, Length: 150, dtype: object,
'frame': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2 \
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
target
0 0
1 0
2 0
3 0
4 0
.. ...
145 2
146 2
147 2
148 2
149 2
[150 rows x 5 columns],
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
iris_dataset.DESCR
print(iris_dataset.DESCR)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
The Iris dataset is a well-known dataset used in pattern recognition research, famously introduced by Sir R.A. Fisher. It comprises 150 instances, with each representing a different iris plant. There are four numeric attributes measured in centimeters: sepal length, sepal width, petal length, and petal width. The dataset includes three classes of iris plants: Iris-Setosa, Iris-Versicolour, and Iris-Virginica, each containing 50 instances. Notably, while one class is linearly separable from the other two, the latter two are not linearly separable from each other. The dataset's summary statistics provide insights into the range, mean, and standard deviation for each attribute, along with class correlations. It's widely used in pattern classification research and serves as a fundamental dataset in the field.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
ID
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Tagrget
from sklearn.datasets import load_iris
import pandas as pd
# Load the Iris dataset
iris = load_iris(as_frame=True)
# Convert data and target attributes to DataFrame
iris_dataset = pd.concat([iris.data, iris.target], axis=1)
iris_dataset.columns = iris.feature_names + ['target']
# Replace numerical target values with class names
iris_dataset['target'] = iris.target_names[iris_dataset['target']]
# Display the DataFrame
print(iris_dataset)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2 \
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
target
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
.. ...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
[150 rows x 5 columns]
iris_dataset["target"]
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: target, Length: 150, dtype: object
for i in range(0, len(iris_dataset["target"])):
if iris_dataset["target"][i] == 'virginica':
iris_dataset.loc[i, "target"] = 'virginica'
else:
iris_dataset.loc[i, "target"] = 'non-virginica'
# Filter data for virginica and non-virginica groups
virginica_category = iris_dataset[iris_dataset.target == 'virginica']
non_virginica_category = iris_dataset[iris_dataset.target == 'non-virginica']
virginica_category.describe()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| count | 50.00000 | 50.000000 | 50.000000 | 50.00000 |
| mean | 6.58800 | 2.974000 | 5.552000 | 2.02600 |
| std | 0.63588 | 0.322497 | 0.551895 | 0.27465 |
| min | 4.90000 | 2.200000 | 4.500000 | 1.40000 |
| 25% | 6.22500 | 2.800000 | 5.100000 | 1.80000 |
| 50% | 6.50000 | 3.000000 | 5.550000 | 2.00000 |
| 75% | 6.90000 | 3.175000 | 5.875000 | 2.30000 |
| max | 7.90000 | 3.800000 | 6.900000 | 2.50000 |
Count: This indicates the number of data points in the dataset for each feature. there are 50 data points for each feature.
Mean: This represents the average value of each feature across all data points. For example, the mean sepal length is 6.588 cm, the mean sepal width is 2.974 cm, the mean petal length is 5.552 cm, and the mean petal width is 2.026 cm.
Std (Standard Deviation): This measures the dispersion or spread of the values around the mean. A higher standard deviation indicates greater variability in the data. For instance, the standard deviation of sepal length is 0.63588 cm.
Min: This shows the minimum value observed for each feature. For instance, the minimum sepal length observed is 4.9 cm.
Max: This indicates the maximum value observed for each feature. For example, the maximum sepal length observed is 7.9 cm.
25%, 50%, and 75%: These represent the quartiles of the data distribution. The 25th percentile (Q1) indicates the value below which 25% of the data falls, the 50th percentile (Q2 or median) represents the value below which 50% of the data falls, and the 75th percentile (Q3) indicates the value below which 75% of the data falls.
virginica_category.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
| 101 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 102 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
| 103 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
| 104 | 6.5 | 3.0 | 5.8 | 2.2 | virginica |
non_virginica_category.describe()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| count | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
| mean | 5.471000 | 3.099000 | 2.861000 | 0.786000 |
| std | 0.641698 | 0.478739 | 1.449549 | 0.565153 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.000000 | 2.800000 | 1.500000 | 0.200000 |
| 50% | 5.400000 | 3.050000 | 2.450000 | 0.800000 |
| 75% | 5.900000 | 3.400000 | 4.325000 | 1.300000 |
| max | 7.000000 | 4.400000 | 5.100000 | 1.800000 |
non_virginica_category.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | non-virginica |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | non-virginica |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | non-virginica |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | non-virginica |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | non-virginica |
# Define a custom color palette
custom_palette = {"non-virginica": "#FF5733", "virginica": "#33FF57"}
# Set the style
sns.set(style="whitegrid")
# Plot histograms for each feature, separated by class
for feature in iris_dataset.columns[:-1]:
plt.figure(figsize=(8, 6))
sns.histplot(data=iris_dataset, x=feature, hue="target", hue_order=['non-virginica', 'virginica'],
palette=custom_palette, kde=True, legend=True)
plt.title(f"Histogram of {feature} for each class")
plt.xlabel(feature)
plt.ylabel("Frequency")
plt.show()
# Select data for the 'virginica' group
virginica_df = iris_dataset[iris_dataset['target'] == 'virginica']
# Calculate the correlation matrix
correlation_matrix = iris_dataset.iloc[:, :-1].corr()
# Set up the matplotlib figure
plt.figure(figsize=(8, 6))
# Draw the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', linewidths=0.5)
plt.title('Correlation Matrix of Features', fontsize=14)
plt.xlabel('Features', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
# Generate correlation matrices for both groups
corr_matrix_virginica = virginica_category.iloc[:, :-1].corr()
corr_matrix_non_virginica = non_virginica_category.iloc[:, :-1].corr()
# Set up the figure and axes
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Draw heatmap for the "virginica" group
sns.heatmap(corr_matrix_virginica, annot=True, cmap='magma', linewidths=.5, ax=axes[0])
axes[0].set_title('Correlation Matrix of Features for the "Virginica" Group')
# Draw heatmap for the "non-virginica" group
sns.heatmap(corr_matrix_non_virginica, annot=True, cmap='magma', linewidths=.5, ax=axes[1])
axes[1].set_title('Correlation Matrix of Features for the "Non-Virginica" Group')
plt.tight_layout()
plt.show()
# Create the boxplot with red and yellow boxes
ax = sns.boxplot(x="target", y="petal length (cm)", data=iris_dataset, palette={"virginica": "red", "non-virginica": "yellow"})
# Create the stripplot with black dots
sns.stripplot(x="target", y="petal length (cm)", data=iris_dataset, jitter=True, edgecolor="black", ax=ax)
# Show the plot
plt.show()
Here is the link of reference : kaggle
Graph Description:
They show the median, quartiles, and potential outliers.
The background is divided into two regions:
Key Observations:
Non-virginica Flowers:
Virginica Flowers:
Overlap:
Conclusion: Petal length can serve as a useful feature for classifying flowers as either non-virginica or virginica. However, other features or a combination of attributes may be necessary for more accurate classification, especially in the overlapping range.
# Filter the dataset for each species
virginica_data = iris_dataset[iris_dataset['target'] == 'virginica']
non_virginica_data = iris_dataset[iris_dataset['target'] != 'virginica']
# Set up the figure and axes
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
# Plot histograms for virginica
virginica_data.plot(kind='hist', bins=50, range=(0, 8), alpha=0.3, ax=axes[0])
axes[0].set_title('Virginica Data Set')
axes[0].set_xlabel('[cm]')
# Plot histograms for non-virginica
non_virginica_data.plot(kind='hist', bins=50, range=(0, 8), alpha=0.3, ax=axes[1])
axes[1].set_title('Non-Virginica Data Set')
axes[1].set_xlabel('[cm]')
plt.tight_layout()
plt.show()
Here is the link of reference : kaggle
Graph Description:
Reason for choosing :
The background is divided into two regions:
Key Observations:
Non-virginica Flowers:
Virginica Flowers:
Conclusion:
fig = px.scatter_3d(iris_dataset, x="sepal width (cm)", y="petal length (cm)", z='petal width (cm)',
color='target')
fig.show()
Here is the link of reference : kaggle
3d Scatter Plot Overview:
Interpretation:
Blue Points (non-virginica): These points cluster towards the lower end of both axes.Non-virginica flowers generally have smaller petal lengths (around 1-2 cm) and narrower petal widths (around 0.5-1 cm).
Red Points (virginica): These points are more spread out across the graph.Virginica flowers exhibit greater variability: Their petal lengths mostly fall between 4-7 cm.Their petal widths vary between 1-2.5 cm.
Conclusion
# Split the data into training, validation, and test sets
X = iris_dataset.iloc[:, :-1]
X_train, X_temp, y_train, y_temp = train_test_split(iris_dataset.iloc[:, :-1], iris_dataset['target'], test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Print the shapes of the resulting sets
print("Shape of the Training Set:", X_train.shape, y_train.shape)
print("Shape of the Validation Set:", X_val.shape, y_val.shape)
print("Shape of the Test Set:", X_test.shape, y_test.shape)
Shape of the Training Set: (120, 4) (120,) Shape of the Validation Set: (15, 4) (15,) Shape of the Test Set: (15, 4) (15,)
# Specify the number of features to consider
feature_number = [1, 2, 3, 4]
# Feature names
feature_names = iris.feature_names
# Iterate over different numbers of features
for count in feature_number:
# Select the first 'count' features from the training, validation, and test sets
X_train_selected = X_train.iloc[:, :count]
X_val_selected = X_val.iloc[:, :count]
X_test_selected = X_test.iloc[:, :count]
# Initialize and train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_selected, y_train)
# Predict labels for the validation set
y_pred = model.predict(X_val_selected)
# Calculate accuracy of the model
accuracy = accuracy_score(y_val, y_pred)
# Display accuracy for the current feature count
print("Number of features considered:", count)
print("Features:", ', '.join(feature_names[:count]))
print("Validation Accuracy:", accuracy)
print() # Empty line for readability
del y_pred
Number of features considered: 1 Features: sepal length (cm) Validation Accuracy: 0.9333333333333333 Number of features considered: 2 Features: sepal length (cm), sepal width (cm) Validation Accuracy: 0.9333333333333333 Number of features considered: 3 Features: sepal length (cm), sepal width (cm), petal length (cm) Validation Accuracy: 1.0 Number of features considered: 4 Features: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm) Validation Accuracy: 1.0
y_test_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_test_pred)
# Suppress warnings
warnings.filterwarnings("ignore")
# Define a function to train and evaluate logistic regression models with different numbers of features
def train_and_evaluate(X_train, X_val, X_test, y_train, y_val, y_test, num_features):
# Select the first 'num_features' columns for training
X_train_subset = X_train.iloc[:, :num_features]
X_val_subset = X_val.iloc[:, :num_features]
X_test_subset = X_test.iloc[:, :num_features]
# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train_subset, y_train)
# Make predictions on validation set
y_pred_val = model.predict(X_val_subset)
# Calculate accuracy on validation set
accuracy_val = accuracy_score(y_val, y_pred_val)
# Make predictions on test set
y_pred_test = model.predict(X_test_subset)
# Calculate accuracy on test set
accuracy_test = accuracy_score(y_test, y_pred_test)
del y_pred_val
del y_pred_test
del model
return accuracy_val, accuracy_test
# Train and evaluate logistic regression models with different numbers of features
feature_names = iris.feature_names
for num_features in range(1, 5):
# Call the function to train and evaluate the model
accuracy_val, accuracy_test = train_and_evaluate(X_train, X_val, X_test, y_train, y_val, y_test, num_features)
print("Model with {} feature(s):".format(num_features))
print("Features:", ', '.join(feature_names[:num_features]))
print("Validation accuracy:", accuracy_val)
print("Test accuracy:", accuracy_test)
print()
Model with 1 feature(s): Features: sepal length (cm) Validation accuracy: 0.9333333333333333 Test accuracy: 0.9333333333333333 Model with 2 feature(s): Features: sepal length (cm), sepal width (cm) Validation accuracy: 0.9333333333333333 Test accuracy: 0.8666666666666667 Model with 3 feature(s): Features: sepal length (cm), sepal width (cm), petal length (cm) Validation accuracy: 1.0 Test accuracy: 1.0 Model with 4 feature(s): Features: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm) Validation accuracy: 1.0 Test accuracy: 1.0
# Define a function to perform cross-validation on logistic regression models with varying feature subsets
def perform_cross_validation(X, y, num_features):
# Select the first 'num_features' columns for training
X_subset = X.iloc[:, :num_features]
# Initialize the logistic regression model
model = LogisticRegression()
# Perform 5-fold cross-validation
cross_val_scores = cross_val_score(model, X_subset, y, cv=5)
return cross_val_scores.mean()
# Iterate through different numbers of features and perform cross-validation
for num_features in range(1, 5):
mean_cross_val_accuracy = perform_cross_validation(pd.concat([X_train, X_val]), pd.concat([y_train, y_val]), num_features)
print(f"Model with {num_features} feature(s) - Mean Cross-Validation Accuracy: {mean_cross_val_accuracy}")
Model with 1 feature(s) - Mean Cross-Validation Accuracy: 0.8074074074074075 Model with 2 feature(s) - Mean Cross-Validation Accuracy: 0.7925925925925925 Model with 3 feature(s) - Mean Cross-Validation Accuracy: 0.9555555555555555 Model with 4 feature(s) - Mean Cross-Validation Accuracy: 0.962962962962963
from tabulate import tabulate
# Define a function to create the evaluation table for a given model
def create_evaluation_table(model, X_val, y_val):
# Make predictions and probabilities
y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)
# Get the occurrence number from the original dataset
occurrence_number = X_val.index + 1
# Create the evaluation table
evaluation_table = pd.DataFrame({
'Instance Number': occurrence_number,
'Probability of Virginica': y_prob[:, 1],
'Prediction': y_pred,
'Ground Truth': y_val
})
del y_prob
del y_pred
return evaluation_table.set_index('Instance Number')
# Define a function to print the evaluation table for a given model
def print_evaluation_table(model_name, eval_table):
print(f"Evaluation Table for Model: {model_name}")
print(tabulate(eval_table, headers='keys', tablefmt='psql'))
print()
# Iterate through each model and print its evaluation table
def individual_table(num_features):
# Train the model
model = LogisticRegression()
model.fit(X_train.iloc[:, :num_features], y_train)
# Create and print the evaluation table
eval_table = create_evaluation_table(model, X_val.iloc[:, :num_features], y_val)
print_evaluation_table(f"Model with {num_features} feature(s)", eval_table)
del model
return eval_table
def calculate_accuracy(eval_table):
correct_predictions = (eval_table['Prediction'] == eval_table['Ground Truth']).sum()
total_instances = len(eval_table)
accuracy = (correct_predictions / total_instances) * 100
print(f"{accuracy:.2f}% is the prediction accuracy.")
predict_table = individual_table(1)
calculate_accuracy(predict_table)
Evaluation Table for Model: Model with 1 feature(s) +-------------------+----------------------------+---------------+----------------+ | Instance Number | Probability of Virginica | Prediction | Ground Truth | |-------------------+----------------------------+---------------+----------------| | 27 | 0.06451 | non-virginica | non-virginica | | 19 | 0.217912 | non-virginica | non-virginica | | 119 | 0.937717 | virginica | virginica | | 146 | 0.671933 | virginica | virginica | | 79 | 0.336388 | non-virginica | non-virginica | | 128 | 0.382264 | non-virginica | virginica | | 109 | 0.671933 | virginica | virginica | | 56 | 0.217912 | non-virginica | non-virginica | | 31 | 0.0442258 | non-virginica | non-virginica | | 30 | 0.0365199 | non-virginica | non-virginica | | 142 | 0.753228 | virginica | virginica | | 111 | 0.578831 | virginica | virginica | | 20 | 0.0776461 | non-virginica | non-virginica | | 133 | 0.529589 | virginica | virginica | | 65 | 0.185827 | non-virginica | non-virginica | +-------------------+----------------------------+---------------+----------------+ 93.33% is the prediction accuracy.
from tabulate import tabulate
# Define a function to create the evaluation table for a given model
def create_evaluation_table(model, X_val, y_val):
# Make predictions and probabilities
y_pred = model.predict(X_val)
y_pred_proba = model.predict_proba(X_val)
# Get the occurrence number from the original dataset
occurrence_number = X_val.index + 1
# Create the evaluation table
table_eve = pd.DataFrame({
'Instance Number': occurrence_number,
'Probability of Virginica': y_pred_proba[:, 1],
'Prediction': y_pred,
'Ground Truth': y_val
})
del y_pred
del y_pred_proba
return table_eve.set_index('Instance Number')
# Define a function to print the evaluation table for a given model
def print_evaluation_table(model_name, table_eve):
print(f"Evaluation Table for Model: {model_name}")
print(tabulate(table_eve, headers='keys', tablefmt='psql'))
print()
# Iterate through each model and print its evaluation table
def individual_table(num_features):
# Train the model
model = LogisticRegression()
model.fit(X_train.iloc[:, :num_features], y_train)
# Create and print the evaluation table
evaluation_table = create_evaluation_table(model, X_val.iloc[:, :num_features], y_val)
print_evaluation_table(f"Model with {num_features} feature(s)", evaluation_table)
del model
return evaluation_table
def calculate_accuracy(table_eve):
accu_predictions = (table_eve['Prediction'] == table_eve['Ground Truth']).sum()
instances_all = len(table_eve)
accuracy = (accu_predictions / instances_all) * 100
print(f"{accuracy:.2f}% is the prediction accuracy.")
predict_table = individual_table(1)
calculate_accuracy(predict_table)
Evaluation Table for Model: Model with 1 feature(s) +-------------------+----------------------------+---------------+----------------+ | Instance Number | Probability of Virginica | Prediction | Ground Truth | |-------------------+----------------------------+---------------+----------------| | 27 | 0.06451 | non-virginica | non-virginica | | 19 | 0.217912 | non-virginica | non-virginica | | 119 | 0.937717 | virginica | virginica | | 146 | 0.671933 | virginica | virginica | | 79 | 0.336388 | non-virginica | non-virginica | | 128 | 0.382264 | non-virginica | virginica | | 109 | 0.671933 | virginica | virginica | | 56 | 0.217912 | non-virginica | non-virginica | | 31 | 0.0442258 | non-virginica | non-virginica | | 30 | 0.0365199 | non-virginica | non-virginica | | 142 | 0.753228 | virginica | virginica | | 111 | 0.578831 | virginica | virginica | | 20 | 0.0776461 | non-virginica | non-virginica | | 133 | 0.529589 | virginica | virginica | | 65 | 0.185827 | non-virginica | non-virginica | +-------------------+----------------------------+---------------+----------------+ 93.33% is the prediction accuracy.
predict_table = individual_table(2)
calculate_accuracy(predict_table)
Evaluation Table for Model: Model with 2 feature(s) +-------------------+----------------------------+---------------+----------------+ | Instance Number | Probability of Virginica | Prediction | Ground Truth | |-------------------+----------------------------+---------------+----------------| | 27 | 0.050796 | non-virginica | non-virginica | | 19 | 0.145047 | non-virginica | non-virginica | | 119 | 0.949865 | virginica | virginica | | 146 | 0.66928 | virginica | virginica | | 79 | 0.347378 | non-virginica | non-virginica | | 128 | 0.379491 | non-virginica | virginica | | 109 | 0.732581 | virginica | virginica | | 56 | 0.237155 | non-virginica | non-virginica | | 31 | 0.0412891 | non-virginica | non-virginica | | 30 | 0.0321397 | non-virginica | non-virginica | | 142 | 0.739476 | virginica | virginica | | 111 | 0.546099 | virginica | virginica | | 20 | 0.0487707 | non-virginica | non-virginica | | 133 | 0.556691 | virginica | virginica | | 65 | 0.193357 | non-virginica | non-virginica | +-------------------+----------------------------+---------------+----------------+ 93.33% is the prediction accuracy.
predict_table = individual_table(3)
calculate_accuracy(predict_table)
Evaluation Table for Model: Model with 3 feature(s) +-------------------+----------------------------+---------------+----------------+ | Instance Number | Probability of Virginica | Prediction | Ground Truth | |-------------------+----------------------------+---------------+----------------| | 27 | 1.57225e-05 | non-virginica | non-virginica | | 19 | 1.3357e-05 | non-virginica | non-virginica | | 119 | 0.998416 | virginica | virginica | | 146 | 0.697669 | virginica | virginica | | 79 | 0.227787 | non-virginica | non-virginica | | 128 | 0.520907 | virginica | virginica | | 109 | 0.95854 | virginica | virginica | | 56 | 0.262784 | non-virginica | non-virginica | | 31 | 1.96817e-05 | non-virginica | non-virginica | | 30 | 1.98174e-05 | non-virginica | non-virginica | | 142 | 0.585897 | virginica | virginica | | 111 | 0.622702 | virginica | virginica | | 20 | 8.92127e-06 | non-virginica | non-virginica | | 133 | 0.921539 | virginica | virginica | | 65 | 0.0152908 | non-virginica | non-virginica | +-------------------+----------------------------+---------------+----------------+ 100.00% is the prediction accuracy.
predict_table = individual_table(4)
calculate_accuracy(predict_table)
Evaluation Table for Model: Model with 4 feature(s) +-------------------+----------------------------+---------------+----------------+ | Instance Number | Probability of Virginica | Prediction | Ground Truth | |-------------------+----------------------------+---------------+----------------| | 27 | 8.96424e-06 | non-virginica | non-virginica | | 19 | 5.70191e-06 | non-virginica | non-virginica | | 119 | 0.998534 | virginica | virginica | | 146 | 0.873922 | virginica | virginica | | 79 | 0.207005 | non-virginica | non-virginica | | 128 | 0.57273 | virginica | virginica | | 109 | 0.946564 | virginica | virginica | | 56 | 0.17067 | non-virginica | non-virginica | | 31 | 7.70235e-06 | non-virginica | non-virginica | | 30 | 7.52261e-06 | non-virginica | non-virginica | | 142 | 0.820222 | virginica | virginica | | 111 | 0.728198 | virginica | virginica | | 20 | 4.12879e-06 | non-virginica | non-virginica | | 133 | 0.956114 | virginica | virginica | | 65 | 0.0161961 | non-virginica | non-virginica | +-------------------+----------------------------+---------------+----------------+ 100.00% is the prediction accuracy.
def train_plot_evaluate(X_train, X_val, y_train, y_val):
# Train and plot decision boundary for one feature
_train_and_plot_decision_boundary(X_train[:, 0:1], y_train, X_val[:, 0:1], y_val, legend=True, line_color='red', bg_color='lightyellow')
plt.title("Decision Boundary for One Feature")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.tight_layout()
plt.show()
# Train and plot decision boundary for two features
_train_and_plot_decision_boundary(X_train[:, 0:2], y_train, X_val[:, 0:2], y_val, legend=True, line_color='red', bg_color='lightcyan')
plt.title("Decision Boundary for Two Features")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.tight_layout()
plt.show()
def _train_and_plot_decision_boundary(X_train, y_train, X_val, y_val, legend=False, line_color='red', bg_color='lightcyan'):
fig, ax = plt.subplots(figsize=(8, 6))
model = LogisticRegression()
model.fit(X_train, y_train)
plot_decision_regions(X_val, y_val, clf=model, legend=2, ax=ax, colors=line_color, scatter_kwargs={'alpha': 0.5}, contourf_kwargs={'alpha': 0.2, 'colors': bg_color})
ax.set_title("Decision Boundary")
if legend:
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ["Non-Virginica", "Virginica"])
del model
X_train = X_train.values
X_val = X_val.values
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_val_encoded = le.transform(y_val)
train_plot_evaluate(X_train, X_val, y_train_encoded, y_val_encoded)
del y_train_encoded, y_val_encoded
def plot_decision_boundary_3d(features, labels, trained_model):
trained_model.fit(features, labels)
feature1_vals = features[:, 0]
feature2_vals = features[:, 1]
feature3_vals = features[:, 2]
x_min, x_max = feature1_vals.min() - 1, feature1_vals.max() + 1
y_min, y_max = feature2_vals.min() - 1, feature2_vals.max() + 1
z_min, z_max = feature3_vals.min() - 1, feature3_vals.max() + 1
xx, yy, zz = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1),
np.arange(z_min, z_max, 0.1))
Z = trained_model.predict(np.c_[xx.ravel(), yy.ravel(), zz.ravel()])
Z = Z.reshape(xx.shape)
fig = px.scatter_3d(x=feature1_vals, y=feature2_vals, z=feature3_vals, color=labels)
fig.add_trace(go.Surface(x=xx.squeeze(), y=yy.squeeze(), z=zz.squeeze(),
surfacecolor=Z.squeeze(), colorscale='turbo',
showscale=False))
coef = trained_model.coef_.squeeze()
intercept = trained_model.intercept_
x_plane = np.linspace(x_min, x_max, 10)
y_plane = np.linspace(y_min, y_max, 10)
xx_plane, yy_plane = np.meshgrid(x_plane, y_plane)
z_plane = (-coef[0] * xx_plane - coef[1] * yy_plane - intercept) / coef[2]
fig.add_trace(go.Surface(x=xx_plane, y=yy_plane, z=z_plane,
opacity=0.5, showscale=False))
fig.update_layout(scene=dict(
xaxis_title='petal length (cm)',
yaxis_title='sepal length (cm)',
zaxis_title='sepal width (cm)'),
title='Decision Boundary for 3 Features')
fig.show()
del trained_model
label_encoder = LabelEncoder()
y_val_encoded = label_encoder.fit_transform(y_val)
X_val_np = X_val
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_val_np[:, :3], y_val_encoded)
plot_decision_boundary_3d(X_val_np[:, :3], y_val_encoded, logistic_regression_model)
del logistic_regression_model
def analyze_failure_modes(model, X_val, y_val):
# Make predictions on the validation set
y_pred = model.predict(X_val)
# Extract instances where the model makes incorrect predictions
incorrect_indices = (y_pred != y_val)
incorrect_X = X_val[incorrect_indices]
incorrect_y_pred = y_pred[incorrect_indices]
incorrect_y_val = y_val[incorrect_indices]
# Create DataFrame to store incorrect predictions
incorrect_predictions = pd.DataFrame(data=incorrect_X, columns=iris.feature_names[:X_val.shape[1]])
incorrect_predictions['Predicted Class'] = incorrect_y_pred
incorrect_predictions['Ground Truth'] = incorrect_y_val
del y_pred
return incorrect_predictions
# Define a function to train models with different numbers of features and analyze failure modes
def analyze_failure_modes_for_models(X_train, y_train, X_val, y_val):
# Store failure modes for each model
failure_modes = {}
for num_features in range(1, 5):
# Train the model
model = LogisticRegression()
model.fit(X_train[:, :num_features], y_train)
# Analyze failure modes
failure_modes[f'Model with {num_features} feature(s)'] = analyze_failure_modes(model, X_val[:, :num_features], y_val)
del model
return failure_modes
# Get failure modes for each model
failure_modes = analyze_failure_modes_for_models(X_train, y_train, X_val, y_val)
# Print failure modes for each model
for model_name, failure_mode_data in failure_modes.items():
print(f"Failure Modes for {model_name}:")
if not failure_mode_data.empty:
print(failure_mode_data)
else:
print("No incorrect predictions.")
print("\n")
Failure Modes for Model with 1 feature(s): sepal length (cm) Predicted Class Ground Truth 0 6.1 non-virginica NaN Failure Modes for Model with 2 feature(s): sepal length (cm) sepal width (cm) Predicted Class Ground Truth 0 6.1 3.0 non-virginica NaN Failure Modes for Model with 3 feature(s): No incorrect predictions. Failure Modes for Model with 4 feature(s): No incorrect predictions.
Model with 1 feature(s)
Instances where the model is wrong:
Model with 2 feature(s)
Instances where the model is wrong:
Model with 3 feature(s)
Instances where the model is wrong:
Model with 4 feature(s)
Instances where the model is wrong:
Overall
# Import necessary libraries
from tabulate import tabulate
# Evaluation results for each model
evaluation_results = {
"Model with 1 feature(s)": {
"Validation Accuracy": 93.33,
"Test Accuracy": 93.33,
},
"Model with 2 feature(s)": {
"Validation Accuracy": 93.33,
"Test Accuracy": 86.67,
"Evaluation Table": [
# Evaluation table data for model with 2 features
]
},
"Model with 3 feature(s)": {
"Validation Accuracy": 100,
"Test Accuracy": 100,
"Evaluation Table": [
# Evaluation table data for model with 3 features
]
},
"Model with 4 feature(s)": {
"Validation Accuracy": 100,
"Test Accuracy": 100,
"Evaluation Table": [
# Evaluation table data for model with 4 features
]
}
}
# Choose the best model (Model with 3 feature(s))
best_model = "Model with 3 feature(s)"
# Summarize the results of the best model on the test set
best_model_summary = f"Summary of results for the best model ({best_model}):\n"
best_model_summary += f"Validation Accuracy: {evaluation_results[best_model]['Validation Accuracy']}%\n"
best_model_summary += f"Test Accuracy: {evaluation_results[best_model]['Test Accuracy']}%\n\n"
# Display the evaluation table for the best model
evaluation_table = evaluation_results[best_model]["Evaluation Table"]
table_headers = ["Instance Number", "Probability of Virginica", "Prediction", "Ground Truth"]
evaluation_table_str = tabulate(evaluation_table, headers=table_headers, tablefmt="grid")
# Print the summary and evaluation table
print(best_model_summary)
Summary of results for the best model (Model with 3 feature(s)): Validation Accuracy: 100% Test Accuracy: 100%
Reason for selecting this model as best model :
Performance Metrics:
Interpretability:
all in all, this model with three features excels due to its perfect accuracy, simplicity, interpretability, and robust generalization, making it the optimal choice for the classification task